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MWe] Safety Architecture 
Patterns 


These tutorials are a simplified 
introduction, and are not sufficient on 
their own to achieve system safety. 
You are responsible for the safety of 
— Me your system. 
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Do not double-spend your redundancy. 


Are You Using A Good Safety Pattern? Mellon 


University 





= Anti-Patterns for Safety: 
e Mixed-SIL software without isolation 
e No redundancy for high criticality functions 
e Fault detection vs. availability confusion 


= Appropriate pattern depends on the system 
e Cross-checked redundancy for fault detection 
e Standby redundancy for availability 
e Separation of Low SIL and High SIL functions 
—- Each SIL must have its own isolated CPU 
— For discussion: 


» SIL 1 & SIL 2 are low criticality (e.g., non-fatal injuries) 
» SIL 3 & SIL 4 are life critical — requires same-SIL redundancy 
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Low SIL Maher 





= Pattern: One Channel (1-of-1) 
e Hardware: single CPU 
e Software: no isolation 


LOW SIL 


= Pro: PRIMARY 
e Simplest pattern 
e Least expensive hardware 
e Suitable for SIL << hardware failure rate 





Single CPU at SIL 1 or SIL 2 
(Inputs/Outputs Not Shown) 


= Con: 
e All software promoted to higher SIL NOTE: 
e Only for low criticality (e.g., SIL 1, 2) Solid Box is a Microcontroller Chip 


— Fails “active” (i.e., many failures are unsafe) 
— HW failure rate has to be infrequent compared to SIL requirements 
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Self-Diagnosis __ University 





= Pattern: One Channel (1-of-1) + Built-In-Self-Test 


= Pro: 


m Con: 


Hardware: single CPU 
Software: additional self-test libraries 


Least expensive hardware 
Suitable for SIL < hardware failure rate 
— Permitted by IEC 60730 with self-test library 





All software promoted to higher SIL 
Only for low criticality (e.g., SIL 1, 2) Single CPU at SIL 3 or SIL 4 
Self-test does not provide high-criticality safety (e.g., SIL 3,4) 


— Fails “active” (i.e., many failures are unsafe) 
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Partitioned Low SIL University 





= Pattern: One Channel with Software Isolation 
e Hardware: single CPU 
e Software: partitioned Low SIL / Higher SIL 


SIL1 !  SIL2 
SOFTWARE : SOFTWARE 
m Pro: 
e Simplest mixed-SIL pattern Single CPU 
- More or less this is an RTOS for task isolation Software Isolation 
, ; (e.g., mirrored variables) 
e Relatively inexpensive hardware 
a4 
m= Con: Ae 
e Requires SIL “isolation argument” SIL 2 ga » SIL 3 
SOFT MARS * SOFTWARE 





— e.g., RTOS memory protection, task timing, I/O isolation, ... 
e Only for low criticality (e.g., SIL 1, 2 NOTE: 
—- Fails “active” (i.e., some failures are unsafe) Dotted Box is on-Chip Partitioning 
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Low SIL, Fail Operational 
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= Pattern: Two Channel Failover (1-of-2) 
e Hardware: primary CPU and backup CPU 
e Software: no isolation 


m= Pro: 
e Simplest high-availability pattern 
e Failover for simple failure modes (low SIL) 
- e.g., loss of heartbeat from Primary 
m Con: 
e All software promoted to higher SIL 
e Requires standby diagnosis 
— E.g., via periodic role reversal and self-test 


e Standby component does not improve SIL 
— Redundancy for availability, not fault detection 


Primary 
Low SIL 
CPU 


Backup 
Low SIL 
CPU 





PRIMARY 
SYSTEM 


STANDBY 


Both CPUs at same SIL 
running same computation 
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Voting Architecture ae 
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= Pattern: Triplex Modular Redundancy (2-of-3) 
e Hardware: Three Primary CPUs plus HW majority voter 


e Software: High SIL Primary Three Identical Primary CPUs 


CHANNEL CHANNEL CHANNEL 
1 2 3 


Single Point 
of Failure 


m@ Pro: 
e Improves availability without internal testing 
— Any fault gets voted out of the majority voter 
— Mismatching unit is most likely the faulty unit 
e This pattern is about improving availability 
- Avoids diagnostic loopholes in failover pattern 
m Con: 
e The voter is a single point of failure HIGH Stee a) 
— High SIL fail-operational voter is challenging! 
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High SIL, Fail Silent  Urniessty 


= Pattern: Two Channel (2-of-2) 
e Hardware: two cross-checked CPUs 





— Includes redundant, cross-checked I/O First Second 
e Software: no isolation SIL 3 SIL 3 
CPU CPU 


m= Pro: 
: : : |CHANNEL |, CROSS- .| CHANNEL | : 

e Simplest High-SIL pattern 3 CHECK : 

— Suitable for life-critical SIL (e.g., SIL 3, 4) 3 


FAIL-SILENT SYSTEM COMPONENT 


e All software promoted to higher SIL Both CPUs at same SIL 
— E.g., if one function is SIL 4, all software must be SIL 4 running same computation 
— Potentially expensive software development 

e Fails “silent” (stops operation) 
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High SIL, Fail Operational ieee 


University 
= Pattern: Dual Two Channel (Dual 2-of-2) 
e Hardware: two pairs of cross-checked CPUs __ HighSIL High SIL 
e Software:noisolation = = — ... ina a, oat 
| |CHANNEL |, cross- .| CHANNEL| | 
e Simplest high-SIL availability pattern ba Low Seth Hietee ae 
— Suitable for life-critical SIL (e.g., SIL 3, 4) All CPUsatsameSiL | FAIL-OVER 
e Fails operational via hot standby running same computation ee FAULT 
= Con: : | CHANNEL |, CROSS- | CHANNEL 3 
e All software promoted to higher SIL , 1 CHECK 2 
— Potentially expensive software development : HOT STANDBY (FAIL-SILENT) 3 
e Requires ensuring standby is ready to go "High SIL High SIL 
— E.g., via periodic role reversal CPU #3 CPU #4 


— Periodic off-line self test improves reliability (proof testing) © 2021 Philip Koopman 9 


Ariane 5 Flight 501 Failure 


m June, 1996 loss of inaugural flight 
e Also lost $400 million scientific payload 
= Primary/Backup Inertial Reference System 
e Reused from Ariane 4 


— But, Ariane 5 had higher horizontal velocity 
— 64-bit float to 16-bit integer overflow in backup 







... followed by ... — 
The exact same numeric overflow in primary me hee 
e Both processors failed = loss of control 





= Software is a single point of failure —— 
e Redundant SW fails the same way re ' & 


| Wout bevan DBLZHWE 
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Low SIL Doer-Checker Mellon 





_ University 
= Pattern: Same-CPU Doer/Checker Pair (mostly fail silent) 
e Hardware: single CPU 
e Software: Doer=Low SIL; Checker=Low SIL Single CPU 
Software Isolation 
= Pro: (e.g., mirrored variables) 
e RTOS can provide some Doer/Checker Isolation 
LOW SIL 
— Perhaps Checker at SIL 2, Doer at SIL 1 SIL 1 PRIMARY 
- Permitted by IEC 60730 Doer S 
SAFETY CHECK 
e Might be able to take credit for higher SIL checker all SHUTDOWN 
= Con: Checker | LOWSIL 
e Requires Doer/Checker isolation argument cha 





— Or, Doer and Checker both need to be at the same, higher SIL 
e Only for low criticality (e.g., SIL 1, 2 


—- Fails “active” (i.e., some failures are unsafe) © 2021 Philip Koopman 12 
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_ Low SIL, Fail Silent Hardware (ele 


= Pattern: Low SIL Doer/Checker Pair 
e Hardware: Primary plus Checker CPU pair 
— Sometimes called an “E-quizzer’ pattern; needs I/O checking! oe 
e Software: Doer=Low SIL; Checker=Low SIL 


LOW SIL 
= Pro: PRIMARY 


e Hardware isolation between Doer/Checker SAFETY CHECK 
- E.g., SIL 1 Doer, SIL 2 Checker with some SW diversity | SHUTDOWN 


e Can lock down checker image despite Doer updates LOW SIL 
CHECKER 








e Non-Desktop OS in Checker could help with security 
= Con: SIL 2 
e Requires self-test for Checker to ensure it’s alive Checker 


e Only for low criticality (e.g., SIL 1, 2) 
— Checker self-test can't be perfect; Fails “active” © 2021 Philip Koopman 13 





High SIL, Fail Silent (Usually Unsafe) Nin 


University 


= Pattern: Attempted High SIL Doer/Checker Pair 


e Hardware: Primary plus Checker CPU pair 
— Sometimes called a High SIL “E-quizzer’ pattern 








SIL 1, 2, 3, 4 
e Software: Doer=High SIL; Checker=High SIL Doer 
= Con: Checker cant be trusted ANY SIL 
e Checker self-test will not find all faults wad 7 
— Single fault containment region cannot CROSS 
s i SAFETY CHECK 
= to 
self-diagnose 100% at SIL 3 or SIL 4 CHECK | SHUTDOWN 


e Doer cannot detect all possible Checker faults 
— “Sanity checks” and “quizzing’ will only find some faults 
— Doer & Checker have different SW — NOT a 2-of-2 pattern! 
e Therefore, Checker will have undetected faults 


— Use for High SIL applications is likely to be unsafe 
» Except for one special case .... see next slide 






HIGH SIL 
CHECKER 
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High SIL, Fail Silent cs 
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= Pattern: High SIL Doer/Checker with Isolated Checker 
e Hardware: Primary Doer/Checker CPU plus Checker CPU 

e Software: Doer=High SIL; Checker=High SIL 
— Checker #1 exactly models Checker #2 behavior 


SIL 3 Checker 
+ SIL 3 Doer 


HIGH SIL ! HIGH SIL 
CHECKER? >  DOER 









m= Pro: 
e Fail-silent behavior with simpler checker CPU 
— Potentially suitable for life-critical SIL (e.g., SIL 3) 
m Con: 
e Requires all High-SIL software; fail-silent 
— Must do proof tests as with dual 2-of-2 architecture 
— Must be careful with potentially coupled Doer/Checker #1 faults 
e Requires Doer/Checker software architecture 
— All software must be at the same SIL; mixed SIL is unsafe 


CROSS- : SAFETY CHECK 
CHECK SHUTDOWN 


HIGH SIL 
CHECKER 


SIL 3 Checker 


CROSS- 
CHECK 
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Mixed SIL, Fail Silent Mella 


m= Pattern: Mixed SIL Doer/Checker 
e Hardware: Primary CPU plus 2-of-2 Checker CPU pair 
e Software: Doer=Low SIL; Checker=High SIL 


SIL 1, 2 
Doer 
e Isolates High SIL software from Low SIL EC RIMARY 
— Suitable for life-critical SIL system (e.g., SIL 3, 4) SAFETY CHECK 


— Checker SIL responsible forsystem safety 70 icccccceceeseeseeeed | et, Meo ~.....am 


e Only critical software developed at high SIL 7 | 
— Enables Low SIL software updates to Doer 3 3 


CHECKER (FAIL-SHUTDOWN 
— Checker CPUs can often be small and cheap , . 


Checker 
e Fail-Silent behavior 


e 3 CPUs instead of 2 for fail-silent system © 2021 Philip Koopman 16 


Mixed SIL, High Availability | Nici 
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= Pattern: Mixed SIL Dual Doer/Fail-Stop Checker 
e Hardware: Dual Primary CPU plus 2-of-2 Checker CPU pair 





e Software: Doer=Low SIL; Checker=High SIL SIL 1,2 Hen thee SIL 1,2 
Doer #1 fN\ Doer #2 
m= Pro: 
LOWER SIL LOWER SIL 
e Likely to be less expensive than dual 2-of-2 PRIMARY STANDBY 
— Only critical software developed at high SIL 
SAFETY CHECK 
— Checker CPUs can often be small and cheap ! SHUTDOWN 7 


- : | CHANNEL |, CROSS- .| CHANNEL| | 
Less likely to have an outage due to Doer fault 3 CHECK 3 
m= Con: : : 


CHECKER (FAIL-SHUTDOWN) 


@ Need to structure software as Doer/Checker Bair * 8a é Hak ee er 


e Not fail operational! Checker #1 Checker #2 
— Low SIL Doer software fault can shut down system © 2021 Philip Koopman 17 
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Best Practices For Safety Architecture —s Meter, 


ia id pine al all aia a 
Consider both HW & oe reopen) ad ar : 
e Doer/checker provides some diversity :__‘ oe Ie ks Ma: 


: FAIL-SILENT SYSTEM COMPONENT 





m= Use building blocks as appropriate 2-of-2 
e Failover for availability wh" 
e 2-of-2 for same-SIL fault detection Failover | upon aut 


e Doer/checker for mixed-SIL fault detection ale 
STANDBY 


ae 
Ek 
am 


m= Pitfalls: 
e Don't double-spend redundancy 
| 


R SIL 
ARY 
e “Clever” shortcuts usually don’t work Doer/Checker SAFETY CHECK 


SHUTDOWN 


e Avoid single points of failure : ini : 
— Don’t forget I/O connection redundancy issues! Lae cer iF Pasa 


CHECKER (FAIL-SHUTDOWN) 
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Acceptable patterns depend upon your safety argument 





ITS IMPORTANT TO KNOW THE INTERNATIONAL WARNING https://xkced.com/2038/ 
SYMBOL FOR RADIOACTIVE HIGH-VOLTAGE LASER-EMITING 
BIOHAZARDS THAT COAT THE FLOOR AND MAKE IT SLIPPERY. © 2021 Philip Koopman 19 


